In [1]:
!pip install -r requirements.txt
Obtaining analyst from git+ssh://git@github.com/CartoDB/kart-oku.git#egg=analyst&subdirectory=api-client (from -r requirements.txt (line 1))
  Cloning ssh://git@github.com/CartoDB/kart-oku.git to ./src/analyst
Obtaining cartoframes from git+ssh://git@github.com/CartoDB/cartoframes.git@linear_legend#egg=cartoframes (from -r requirements.txt (line 2))
  Cloning ssh://git@github.com/CartoDB/cartoframes.git (to revision linear_legend) to ./src/cartoframes
Rama 'linear_legend' configurada para hacer seguimiento a la rama remota 'linear_legend' de 'origin'.
Cambiado a nueva rama 'linear_legend'
Obtaining tiletanic from git+ssh://git@github.com/DigitalGlobe/tiletanic.git#egg=tiletanic (from -r requirements.txt (line 3))
  Cloning ssh://git@github.com/DigitalGlobe/tiletanic.git to ./src/tiletanic
Requirement already satisfied: geopandas in /Users/giulia/anaconda3/lib/python3.7/site-packages (from -r requirements.txt (line 4)) (0.4.0)
Requirement already satisfied: matplotlib in /Users/giulia/anaconda3/lib/python3.7/site-packages (from -r requirements.txt (line 5)) (3.0.1)
Requirement already satisfied: pandas in /Users/giulia/anaconda3/lib/python3.7/site-packages (from -r requirements.txt (line 6)) (0.23.4)
Requirement already satisfied: shapely in /Users/giulia/anaconda3/lib/python3.7/site-packages (from -r requirements.txt (line 7)) (1.6.4.post1)
Requirement already satisfied: seaborn in /Users/giulia/anaconda3/lib/python3.7/site-packages (from -r requirements.txt (line 8)) (0.9.0)
Requirement already satisfied: sklearn in /Users/giulia/anaconda3/lib/python3.7/site-packages (from -r requirements.txt (line 9)) (0.0)
Requirement already satisfied: mercantile in /Users/giulia/anaconda3/lib/python3.7/site-packages (from -r requirements.txt (line 10)) (1.0.4)
Requirement already satisfied: pyproj in /Users/giulia/anaconda3/lib/python3.7/site-packages (from -r requirements.txt (line 11)) (1.9.5.1)
Requirement already satisfied: pygeotile in /Users/giulia/anaconda3/lib/python3.7/site-packages (from -r requirements.txt (line 12)) (1.0.6)
Requirement already satisfied: matplotlib_venn in /Users/giulia/anaconda3/lib/python3.7/site-packages (from -r requirements.txt (line 13)) (0.11.5)
Requirement already satisfied: scikit-gstat in /Users/giulia/anaconda3/lib/python3.7/site-packages (from -r requirements.txt (line 14)) (0.2.3)
Requirement already satisfied: xgboost in /Users/giulia/anaconda3/lib/python3.7/site-packages (from -r requirements.txt (line 15)) (0.81)
Requirement already satisfied: h2o in /Users/giulia/anaconda3/lib/python3.7/site-packages (from -r requirements.txt (line 16)) (3.22.1.4)
Requirement already satisfied: ipywidgets in /Users/giulia/anaconda3/lib/python3.7/site-packages (from -r requirements.txt (line 17)) (7.4.2)
Requirement already satisfied: missingno in /Users/giulia/anaconda3/lib/python3.7/site-packages (from -r requirements.txt (line 18)) (0.4.1)
Requirement already satisfied: eli5 in /Users/giulia/anaconda3/lib/python3.7/site-packages (from -r requirements.txt (line 20)) (0.8.2)
Requirement already satisfied: jupyter in /Users/giulia/anaconda3/lib/python3.7/site-packages (from -r requirements.txt (line 21)) (1.0.0)
Requirement already satisfied: requests in /Users/giulia/anaconda3/lib/python3.7/site-packages (from analyst->-r requirements.txt (line 1)) (2.21.0)
Requirement already satisfied: webcolors>=1.7.0 in /Users/giulia/anaconda3/lib/python3.7/site-packages (from cartoframes->-r requirements.txt (line 2)) (1.8.1)
Requirement already satisfied: carto>=1.4.0 in /Users/giulia/anaconda3/lib/python3.7/site-packages (from cartoframes->-r requirements.txt (line 2)) (1.4.0)
Requirement already satisfied: tqdm>=4.14.0 in /Users/giulia/anaconda3/lib/python3.7/site-packages (from cartoframes->-r requirements.txt (line 2)) (4.31.1)
Requirement already satisfied: appdirs>=1.4.3 in /Users/giulia/anaconda3/lib/python3.7/site-packages (from cartoframes->-r requirements.txt (line 2)) (1.4.3)
Requirement already satisfied: IPython>=6.0.0 in /Users/giulia/anaconda3/lib/python3.7/site-packages (from cartoframes->-r requirements.txt (line 2)) (7.1.1)
Requirement already satisfied: click in /Users/giulia/anaconda3/lib/python3.7/site-packages (from tiletanic->-r requirements.txt (line 3)) (7.0)
Requirement already satisfied: geojson in /Users/giulia/anaconda3/lib/python3.7/site-packages (from tiletanic->-r requirements.txt (line 3)) (2.4.1)
Requirement already satisfied: fiona in /Users/giulia/anaconda3/lib/python3.7/site-packages (from geopandas->-r requirements.txt (line 4)) (1.7.12)
Requirement already satisfied: numpy>=1.10.0 in /Users/giulia/anaconda3/lib/python3.7/site-packages (from matplotlib->-r requirements.txt (line 5)) (1.15.4)
Requirement already satisfied: cycler>=0.10 in /Users/giulia/anaconda3/lib/python3.7/site-packages (from matplotlib->-r requirements.txt (line 5)) (0.10.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /Users/giulia/anaconda3/lib/python3.7/site-packages (from matplotlib->-r requirements.txt (line 5)) (1.0.1)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /Users/giulia/anaconda3/lib/python3.7/site-packages (from matplotlib->-r requirements.txt (line 5)) (2.3.0)
Requirement already satisfied: python-dateutil>=2.1 in /Users/giulia/anaconda3/lib/python3.7/site-packages (from matplotlib->-r requirements.txt (line 5)) (2.6.0)
Requirement already satisfied: pytz>=2011k in /Users/giulia/anaconda3/lib/python3.7/site-packages (from pandas->-r requirements.txt (line 6)) (2018.7)
Requirement already satisfied: scipy>=0.14.0 in /Users/giulia/anaconda3/lib/python3.7/site-packages (from seaborn->-r requirements.txt (line 8)) (1.1.0)
Requirement already satisfied: scikit-learn in /Users/giulia/anaconda3/lib/python3.7/site-packages (from sklearn->-r requirements.txt (line 9)) (0.20.1)
Requirement already satisfied: nose in /Users/giulia/anaconda3/lib/python3.7/site-packages (from scikit-gstat->-r requirements.txt (line 14)) (1.3.7)
Requirement already satisfied: numba in /Users/giulia/anaconda3/lib/python3.7/site-packages (from scikit-gstat->-r requirements.txt (line 14)) (0.40.0)
Requirement already satisfied: colorama>=0.3.8 in /Users/giulia/anaconda3/lib/python3.7/site-packages (from h2o->-r requirements.txt (line 16)) (0.4.0)
Requirement already satisfied: tabulate in /Users/giulia/anaconda3/lib/python3.7/site-packages (from h2o->-r requirements.txt (line 16)) (0.8.3)
Requirement already satisfied: future in /Users/giulia/anaconda3/lib/python3.7/site-packages (from h2o->-r requirements.txt (line 16)) (0.17.1)
Requirement already satisfied: ipykernel>=4.5.1 in /Users/giulia/anaconda3/lib/python3.7/site-packages (from ipywidgets->-r requirements.txt (line 17)) (5.1.0)
Requirement already satisfied: nbformat>=4.2.0 in /Users/giulia/anaconda3/lib/python3.7/site-packages (from ipywidgets->-r requirements.txt (line 17)) (4.4.0)
Requirement already satisfied: traitlets>=4.3.1 in /Users/giulia/anaconda3/lib/python3.7/site-packages (from ipywidgets->-r requirements.txt (line 17)) (4.3.2)
Requirement already satisfied: widgetsnbextension~=3.4.0 in /Users/giulia/anaconda3/lib/python3.7/site-packages (from ipywidgets->-r requirements.txt (line 17)) (3.4.2)
Requirement already satisfied: attrs>16.0.0 in /Users/giulia/anaconda3/lib/python3.7/site-packages (from eli5->-r requirements.txt (line 20)) (18.2.0)
Requirement already satisfied: typing in /Users/giulia/anaconda3/lib/python3.7/site-packages (from eli5->-r requirements.txt (line 20)) (3.6.6)
Requirement already satisfied: graphviz in /Users/giulia/anaconda3/lib/python3.7/site-packages (from eli5->-r requirements.txt (line 20)) (0.10.1)
Requirement already satisfied: six in /Users/giulia/anaconda3/lib/python3.7/site-packages (from eli5->-r requirements.txt (line 20)) (1.11.0)
Requirement already satisfied: jinja2 in /Users/giulia/anaconda3/lib/python3.7/site-packages (from eli5->-r requirements.txt (line 20)) (2.10)
Requirement already satisfied: nbconvert in /Users/giulia/anaconda3/lib/python3.7/site-packages (from jupyter->-r requirements.txt (line 21)) (5.3.1)
Requirement already satisfied: jupyter-console in /Users/giulia/anaconda3/lib/python3.7/site-packages (from jupyter->-r requirements.txt (line 21)) (6.0.0)
Requirement already satisfied: notebook in /Users/giulia/anaconda3/lib/python3.7/site-packages (from jupyter->-r requirements.txt (line 21)) (5.7.2)
Requirement already satisfied: qtconsole in /Users/giulia/anaconda3/lib/python3.7/site-packages (from jupyter->-r requirements.txt (line 21)) (4.4.2)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /Users/giulia/anaconda3/lib/python3.7/site-packages (from requests->analyst->-r requirements.txt (line 1)) (3.0.4)
Requirement already satisfied: idna<2.9,>=2.5 in /Users/giulia/anaconda3/lib/python3.7/site-packages (from requests->analyst->-r requirements.txt (line 1)) (2.7)
Requirement already satisfied: urllib3<1.25,>=1.21.1 in /Users/giulia/anaconda3/lib/python3.7/site-packages (from requests->analyst->-r requirements.txt (line 1)) (1.23)
Requirement already satisfied: certifi>=2017.4.17 in /Users/giulia/anaconda3/lib/python3.7/site-packages (from requests->analyst->-r requirements.txt (line 1)) (2018.11.29)
Requirement already satisfied: pyrestcli>=0.6.4 in /Users/giulia/anaconda3/lib/python3.7/site-packages (from carto>=1.4.0->cartoframes->-r requirements.txt (line 2)) (0.6.7)
Requirement already satisfied: jedi>=0.10 in /Users/giulia/anaconda3/lib/python3.7/site-packages (from IPython>=6.0.0->cartoframes->-r requirements.txt (line 2)) (0.13.1)
Requirement already satisfied: appnope; sys_platform == "darwin" in /Users/giulia/anaconda3/lib/python3.7/site-packages (from IPython>=6.0.0->cartoframes->-r requirements.txt (line 2)) (0.1.0)
Requirement already satisfied: backcall in /Users/giulia/anaconda3/lib/python3.7/site-packages (from IPython>=6.0.0->cartoframes->-r requirements.txt (line 2)) (0.1.0)
Requirement already satisfied: pexpect; sys_platform != "win32" in /Users/giulia/anaconda3/lib/python3.7/site-packages (from IPython>=6.0.0->cartoframes->-r requirements.txt (line 2)) (4.6.0)
Requirement already satisfied: setuptools>=18.5 in /Users/giulia/anaconda3/lib/python3.7/site-packages (from IPython>=6.0.0->cartoframes->-r requirements.txt (line 2)) (40.6.2)
Requirement already satisfied: pygments in /Users/giulia/anaconda3/lib/python3.7/site-packages (from IPython>=6.0.0->cartoframes->-r requirements.txt (line 2)) (2.2.0)
Requirement already satisfied: prompt-toolkit<2.1.0,>=2.0.0 in /Users/giulia/anaconda3/lib/python3.7/site-packages (from IPython>=6.0.0->cartoframes->-r requirements.txt (line 2)) (2.0.7)
Requirement already satisfied: decorator in /Users/giulia/anaconda3/lib/python3.7/site-packages (from IPython>=6.0.0->cartoframes->-r requirements.txt (line 2)) (4.0.11)
Requirement already satisfied: pickleshare in /Users/giulia/anaconda3/lib/python3.7/site-packages (from IPython>=6.0.0->cartoframes->-r requirements.txt (line 2)) (0.7.5)
Requirement already satisfied: cligj>=0.4 in /Users/giulia/anaconda3/lib/python3.7/site-packages (from fiona->geopandas->-r requirements.txt (line 4)) (0.5.0)
Requirement already satisfied: click-plugins in /Users/giulia/anaconda3/lib/python3.7/site-packages (from fiona->geopandas->-r requirements.txt (line 4)) (1.0.4)
Requirement already satisfied: munch in /Users/giulia/anaconda3/lib/python3.7/site-packages (from fiona->geopandas->-r requirements.txt (line 4)) (2.3.2)
Requirement already satisfied: llvmlite>=0.25.0dev0 in /Users/giulia/anaconda3/lib/python3.7/site-packages (from numba->scikit-gstat->-r requirements.txt (line 14)) (0.25.0)
Requirement already satisfied: tornado>=4.2 in /Users/giulia/anaconda3/lib/python3.7/site-packages (from ipykernel>=4.5.1->ipywidgets->-r requirements.txt (line 17)) (5.1.1)
Requirement already satisfied: jupyter-client in /Users/giulia/anaconda3/lib/python3.7/site-packages (from ipykernel>=4.5.1->ipywidgets->-r requirements.txt (line 17)) (5.2.3)
Requirement already satisfied: jupyter-core in /Users/giulia/anaconda3/lib/python3.7/site-packages (from nbformat>=4.2.0->ipywidgets->-r requirements.txt (line 17)) (4.4.0)
Requirement already satisfied: ipython-genutils in /Users/giulia/anaconda3/lib/python3.7/site-packages (from nbformat>=4.2.0->ipywidgets->-r requirements.txt (line 17)) (0.2.0)
Requirement already satisfied: jsonschema!=2.5.0,>=2.4 in /Users/giulia/anaconda3/lib/python3.7/site-packages (from nbformat>=4.2.0->ipywidgets->-r requirements.txt (line 17)) (2.6.0)
Requirement already satisfied: MarkupSafe>=0.23 in /Users/giulia/anaconda3/lib/python3.7/site-packages (from jinja2->eli5->-r requirements.txt (line 20)) (1.1.0)
Requirement already satisfied: mistune>=0.7.4 in /Users/giulia/anaconda3/lib/python3.7/site-packages (from nbconvert->jupyter->-r requirements.txt (line 21)) (0.8.4)
Requirement already satisfied: bleach in /Users/giulia/anaconda3/lib/python3.7/site-packages (from nbconvert->jupyter->-r requirements.txt (line 21)) (3.0.2)
Requirement already satisfied: entrypoints>=0.2.2 in /Users/giulia/anaconda3/lib/python3.7/site-packages (from nbconvert->jupyter->-r requirements.txt (line 21)) (0.2.3)
Requirement already satisfied: testpath in /Users/giulia/anaconda3/lib/python3.7/site-packages (from nbconvert->jupyter->-r requirements.txt (line 21)) (0.4.2)
Requirement already satisfied: pandocfilters>=1.4.1 in /Users/giulia/anaconda3/lib/python3.7/site-packages (from nbconvert->jupyter->-r requirements.txt (line 21)) (1.4.2)
Requirement already satisfied: terminado>=0.8.1 in /Users/giulia/anaconda3/lib/python3.7/site-packages (from notebook->jupyter->-r requirements.txt (line 21)) (0.8.1)
Requirement already satisfied: pyzmq>=17 in /Users/giulia/anaconda3/lib/python3.7/site-packages (from notebook->jupyter->-r requirements.txt (line 21)) (17.1.2)
Requirement already satisfied: Send2Trash in /Users/giulia/anaconda3/lib/python3.7/site-packages (from notebook->jupyter->-r requirements.txt (line 21)) (1.5.0)
Requirement already satisfied: prometheus-client in /Users/giulia/anaconda3/lib/python3.7/site-packages (from notebook->jupyter->-r requirements.txt (line 21)) (0.4.2)
Requirement already satisfied: parso>=0.3.0 in /Users/giulia/anaconda3/lib/python3.7/site-packages (from jedi>=0.10->IPython>=6.0.0->cartoframes->-r requirements.txt (line 2)) (0.3.1)
Requirement already satisfied: ptyprocess>=0.5 in /Users/giulia/anaconda3/lib/python3.7/site-packages (from pexpect; sys_platform != "win32"->IPython>=6.0.0->cartoframes->-r requirements.txt (line 2)) (0.6.0)
Requirement already satisfied: wcwidth in /Users/giulia/anaconda3/lib/python3.7/site-packages (from prompt-toolkit<2.1.0,>=2.0.0->IPython>=6.0.0->cartoframes->-r requirements.txt (line 2)) (0.1.7)
Requirement already satisfied: webencodings in /Users/giulia/anaconda3/lib/python3.7/site-packages (from bleach->nbconvert->jupyter->-r requirements.txt (line 21)) (0.5.1)
Installing collected packages: analyst, cartoframes, tiletanic
  Found existing installation: analyst 0.0.1
    Uninstalling analyst-0.0.1:
      Successfully uninstalled analyst-0.0.1
  Running setup.py develop for analyst
  Found existing installation: cartoframes 0.9.2
    Uninstalling cartoframes-0.9.2:
      Successfully uninstalled cartoframes-0.9.2
  Running setup.py develop for cartoframes
  Found existing installation: tiletanic 0.0.6
    Uninstalling tiletanic-0.0.6:
      Successfully uninstalled tiletanic-0.0.6
  Running setup.py develop for tiletanic
Successfully installed analyst cartoframes tiletanic
In [2]:
import importlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

from IPython.display import display, HTML, clear_output

from cartoframes.contrib import vector
from cartoframes import Credentials, CartoContext, QueryLayer

import cf
from cf.do import qk2mercatorpolygon
import pandas as pd

import seaborn as sns
import numpy as np
import missingno as msno
import re

from analysis.twin import draw_boxplot_df, draw_hist, draw_corrm

import ipywidgets as widgets
def view_variable(changed):
    clear_output()
    display(dropdown)
    variable = changed['new']
    color = f'ramp(prop("{variable}"), Temps)'
    display(cf.viz.map(nyc_wework_locations_ta_augmented_gdf, 
                       cc, 
                       color=color,
                       legend=variable,
                       interactivity=variable))
    nyc_wework_locations_ta_augmented_gdf[variable].plot.hist()

def entities2df(entities):
    return pd.DataFrame.from_records([o.to_record() for o in entities])

Introduction

WeWork is a company that provides shared office spaces. Currently WeWork offers 58 offices in NYC and only 21 offices in LA. Given a WeWork office location in NYC we want to combine the different data sources stored in CARTO's Data Observatory (DO) to find locations in LA that show similar characteristics. The results of this similarity analysis will indicate the best spots where WeWork should plan opening a new office / relocating an existing office with similar characteristics with respect to the target location in NYC.

The notebook is organized as follows:

  1. Accessing CARTO's DO and exploring the DO functionality
  2. Implementing the Similarity Analysis

Technical note: main connectors

cf.do.grid_cells(polygon, variables, zoom) --> from a target polygon retrieve the grid cells enriched with data from the DO (at a given zoom level)

cf.do.augment(polygons,variables, zoom) --> enrich each polygon with the data from the DO (at a given zoom level)

cf.load.grid_data(input_file, fields, zooms) --> interpolate custom data to the the DO common DO grid (at a given zoom level). Here 'fields' need both the variable name and the interpolation function required, e.g. fields={'txs_count': 'sum'}.

1. Accessing CARTO's DO and exploring the catalog

Environment preparation

We setup the CARTOFrames credentials and context.

In [27]:
creds = Credentials(username='', key='')
cc = CartoContext(creds=creds)

Data Observatory Catalog discovery: access the data

Let's explore which regions do we have data for in the US (DO & CARTOframes )

In [4]:
regions = cf.catalog.list_regions(search_term='US')
regions_df = entities2df(regions)
cf.viz.map(regions_df, cc,color='opacity(red, .5)')
Out[4]:

Let's explore what spatial resolutions are available for the selected regions (DO)

In [5]:
spatial_resolutions_ny = entities2df(regions[1].list_spatial_resolutions())
spatial_resolutions_ny['region'] = 'NY'
spatial_resolutions_la  = entities2df(regions[2].list_spatial_resolutions())
spatial_resolutions_la['region'] = 'LA'

print('Available spatial resolutions:')
spatial_resolutions = pd.merge(
    spatial_resolutions_ny[['description','region']],
    spatial_resolutions_la[['description','region']],
    on='description',
    how='outer')

spatial_resolutions['region'] = spatial_resolutions[['region_x', 'region_y']].values.tolist()
spatial_resolutions.drop(['region_x', 'region_y'], axis=1, inplace=True)

spatial_resolutions
Available spatial resolutions:
Out[5]:
description region
0 76 meters resolution. Quadkey grid level 19 [NY, LA]
1 305 meters resolution. Quadkey grid level 17 [NY, LA]

Let's see a summary of the data available for the region of NY for a given spatial resolution: providers and packages (DO)

In [6]:
ny = regions[1]
spatial_resolutions = ['305 meters']
providers = ny.list_providers(spatial_resolutions=spatial_resolutions)
providers_df = entities2df(providers)
providers_df[['description','id','name']].sort_values(by = 'id')
Out[6]:
description id name
4 Open demographics data from CARTO Data Observa... 1 CARTO
6 Unica360 demographics data for Spain 2 Unica360
7 POIs 3 Pitney Bowes
8 POIs and highways 4 Open Street Maps
9 Mastercard financial data for US 5 Mastercard
0 BBVA financial data for Spain 6 BBVA
1 Unacast mobility data for US 7 Unacast
2 Emodo mobility data for US 8 Emodo
3 Vodafone mobility data for US 9 Vodafone
5 TomTom traffic data for Spain 10 TomTom

Example 1: POIs from Pitney Bowes tagged as food.

In [7]:
selected_variables = ['Food']
selected_poi_attributes = cf.catalog.list_attributes(regions=['NY'], spatial_resolutions=spatial_resolutions, tags=selected_variables)
selected_poi_attributes =  pd.DataFrame.from_records([o.__dict__ for o in selected_poi_attributes])
selected_poi_attributes['spatial_resolutions'] = spatial_resolutions[0]
selected_poi_attributes[['description','source_description','data_since','data_until','temporal_resolutions','agg']].head()
Out[7]:
description source_description data_since data_until temporal_resolutions agg
0 Number of accomodation POIs, it may include da... Aggregated grid data for POIs None sum
1 Number of education POIs, it may include data ... Aggregated grid data for POIs None sum
2 Number of entertainment POIs, it may include d... Aggregated grid data for POIs None sum
3 Number of entertainment POIs, it may include d... Aggregated grid data for POIs None sum
4 Number of food POIs, it may include data from ... Aggregated grid data for POIs None sum

Example 2: Population and income data from the DO.

In [8]:
selected_variables = ['Population', 'Income']
selected_demographics_attributes = cf.catalog.list_attributes(regions=['NY'], spatial_resolutions=spatial_resolutions, tags=selected_variables)
selected_demographics_attributes =  pd.DataFrame.from_records([o.__dict__ for o in selected_demographics_attributes])
selected_demographics_attributes['spatial_resolutions'] = spatial_resolutions[0]
selected_demographics_attributes[['description','source_description','data_since','data_until','temporal_resolutions','agg']].head()                                                 
Out[8]:
description source_description data_since data_until temporal_resolutions agg
0 Median income Aggregated grid data for demographics None median
1 Median rent Aggregated grid data for demographics None median
2 Total population Aggregated grid data for demographics None sum

Data Observatory Catalog discovery: visualize the data

Let's compute 10 min walking trade areas from WeWork locations in NY and visualize a summary of the available data (DO & CARTOframes)

We first retrieve the list of WeWork NYC buildings (already stored in the DO).

In [9]:
query = "SELECT * FROM wework_buildings WHERE primary_market = 'New York City'"
wework_nyc_locations_df = cc.query(query, decode_geom=True)
wework_nyc_locations_df[['name', 'address', 'is_open','min_office_price']].head(10)
Out[9]:
name address is_open min_office_price
cartodb_id
1 109 S 5th St 109 S 5th Street Brooklyn NY 11249 True $750
27 261 Madison Ave 261 Madison Avenue New York NY 10016 True $890
2 110 Wall St 110 Wall St New York NY 10005 True $890
3 115 Broadway 115 Broadway Street 5th Floor New York NY 10006 True $980
4 11 Park Pl 11 Park Place New York NY 10007 True $930
5 120 E 23rd St 120 E 23rd St New York NY 10010 True $950
6 125 W 25th St 125 West 25th Street New York NY 10001 True $2,470
7 12 E 49th St 12 East 49th Street 11th floor New York NY 10017 True $1,020
8 134 N 4th St 134 N 4th St. Brooklyn NY 11249 True $1,010
9 135 E 57th St 135 E 57th Street New York NY 10022 True $1,190

Then, we define the spatial resolution of the grid used for the analysis

In [10]:
zoom = 17

and the variables we want to display.

In [11]:
variables = cf.catalog.list_variables(categories='all')
selected_variables = []
for v in variables:
    if v.category != 'es_demographics' \
    and v.category != 'es_mobility' \
    and v.category != 'es_financial' \
    and v.category != 'es_traffic' \
    and v.category != 'es_poi':
        selected_variables.append(v)

We then compute the trade areas and augment them using the data in the DO for a given spatial resolution.

In [12]:
nyc_wework_locations_ta_gdf = cf.lds.isochrone(wework_nyc_locations_df,
                                               mode='walk', time=600)
nyc_wework_locations_ta_augmented_gdf = cf.do.augment(nyc_wework_locations_ta_gdf,
                                                      variables=selected_variables, z=zoom)

Finally, we can visualize a summary of the data for each variable.

In [13]:
initial_variable = selected_variables[0].id
dropdown = widgets.Dropdown(
    options=[v.id for v in selected_variables],
    value=initial_variable,
    description='Variable'
)

dropdown.observe(view_variable, names='value')
view_variable({'new': initial_variable})

2.Implementing the similarity analysis

Overview of the method

Based on the existing NYC WeWork locations, we want to find locations in LA that are characterized by "similar" values of the selected variables. The general approach is to compute the one-to-one distance in variable space

\begin{equation*} d\left(\mathbf{Y}_{origin}(i),\mathbf{Y}_{target}(f)\right) = \sqrt{\sum_j\left(Y_{origin}(i){_j} -Y_{target}(f){_j}\right)^2} \end{equation*}

between a selected origin cell ($i$) and each target cell ($f$), where the index $j$ identifies the variable type (e.g. total population).

Let's define the zoom level for the analysis

First, we define the spatial resolution of the grid used for the analysis.

In [14]:
zoom = 17

Let's define the variables we want to use in the similarity analysis (DO)

These are user defined variables, which will depend on the particular application. For example, for this use case, we do not want to include the road type-related POIs, otherwise the results may be infuenced by the different road plans for LA and NY.

In [15]:
variables = cf.catalog.list_variables(categories='all')
selected_variables = []
for v in variables:
    if v.category != 'es_demographics' \
    and v.category != 'es_mobility' \
    and v.category != 'es_financial' \
    and v.category != 'es_traffic' \
    and v.category != 'es_poi':
        if v.category == 'mobility_emodo':
            if re.search('emd_all_all', v.name):
                selected_variables.append(v)
        if v.category == 'demographics':
            selected_variables.append(v)
        elif v.category == 'poi':
            if not re.search('highway', v.name):
                selected_variables.append(v)
        elif v.category == 'financial':
            if v.name[:3] in ('acc', 'ret','eap','gro','btn') \
            and re.search('(spend_amt|trans_cnt|acct_cnt)$', v.name):
                selected_variables.append(v)
pd.DataFrame(selected_variables, columns=['variables'])
Out[15]:
variables
0 demographics__total_population
1 demographics__median_income
2 demographics__median_rent
3 poi__accomodation
4 poi__education
5 poi__entertainment
6 poi__finance
7 poi__food
8 poi__health
9 poi__office
10 poi__shop
11 poi__transport
12 poi__employee_here
13 poi__sales_volume
14 poi__employee_count
15 mobility_emodo__emd_all_all
16 financial__acc_index_weighted_spend_amt
17 financial__acc_index_weighted_trans_cnt
18 financial__acc_index_weighted_acct_cnt
19 financial__btn_index_weighted_spend_amt
20 financial__btn_index_weighted_trans_cnt
21 financial__btn_index_weighted_acct_cnt
22 financial__eap_index_weighted_spend_amt
23 financial__eap_index_weighted_trans_cnt
24 financial__eap_index_weighted_acct_cnt
25 financial__gro_index_weighted_spend_amt
26 financial__gro_index_weighted_trans_cnt
27 financial__gro_index_weighted_acct_cnt
28 financial__ret_index_weighted_spend_amt
29 financial__ret_index_weighted_trans_cnt
30 financial__ret_index_weighted_acct_cnt

Let's define the origin and target cells (DO & CARTOframes)

In [16]:
origin_locations = wework_nyc_locations_df['geometry']

target_polygon = cf.catalog.named_entity('Los Angeles')
from shapely.geometry import box
cf.viz.map(target_polygon, cc, color='opacity(red, .2)')
Out[16]:

Let's preview some of the cells we are going to use for the analysis (CARTOframes)

In [17]:
target_polygon_variables = cf.do.grid_cells(target_polygon, selected_variables, zoom)
target_polygon_variables['geometry'] = [qk2mercatorpolygon(qk) for qk in target_polygon_variables.index]
variable_id = selected_variables[0].id
print(variable_id)
cf.viz.map(target_polygon_variables.filter(regex='^02301231113', axis=0), cc,
           strokeWidth=.5, color=f'opacity(ramp(prop("{variable_id}"), Temps), .5)',
           interactivity=variable_id)
do__demographics__total_population
Out[17]:

Preliminary analysis

To properly compute distances, we need to take into account few drawbacks.

  • what happens when the variables have different variances? It is clear that in this case that distances in total population will be given a different weight than the distances in the number of food POIs

  • what happens when there are correlations between variables? When there are correlated variables, there is a lot of redundant information in the computation of the distance. By considering the covariance between the points in the distance computation, we can remove that redundancy

  • how do we compute distances in the presence of missing data?

Do variables have different variances?

Note: the y-scale is logarithmic!

In [18]:
fig, axes = plt.subplots(figsize=(15, 7.5))
draw_boxplot_df(target_polygon_variables, axes, fig, [v.id for v in selected_variables])
plt.yscale("log")

Are variables correlated?

This plot show the correlation matrix for the selected variables. Matrix elements with larger values imply larger correlations.

In [19]:
draw_corrm(target_polygon_variables[[v.id for v in selected_variables]])

Are there any missing data?

We can then look at the proportion of missing data for each selected variable.

In [20]:
msno.bar(target_polygon_variables[[v.id for v in selected_variables]])
Out[20]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a22222128>

If a POI variable is missing for a given location we can assume that the number of POIs for that variable in that location is zero. On the other hand, for other variables (e.g. demographics) we need to deal with missing values.

Bringing all together (DO & CARTOframes & DS)

Compute distances in variable space

\begin{equation*} d\left(\mathbf{Y}_{origin}(i),\mathbf{Y}_{target}(f)\right) = \sqrt{\sum_j\left(Y_{origin}(i){_j} -Y_{target}(f){_j}\right)^2} \end{equation*}

  • To account for DIFFERENT VARIANCES, the data are standardized such that they have zero mean and unit variance. This also implies that all variables are given the same weight when computing distances

  • To account for CORRELATED VARIABLES the distances are not computed in the original variable space but in the $\textbf{Principal Components Space (PCA)}$ where the variables are linearly uncorrelated. PCA is also used to reduce the noise in the data, by retaining only a subset of PCs. To account for the uncertainty in the number of retained PCs, we created an ensemble of distance computations and, for each ensemble member, we randomly set the number of retained PC components within the range defined by the number of PCs explaining respectively 90% and 100% of the variance. For more details on how PCA works see the box below

  • To account for MISSING DATA, here we are using a probabilistic approach to PCA, called $\textbf{Probabilistic PCA (PPCA)}$. In PPCA the complete data are modelled by a generative latent variable model which iteratively updates the expected complete data log-likelihood and the maximum likelihood estimates of the parameters. PPCA has also the advantage of being faster than PCA because it does not require to compute the eigen-decomposition of the data covariance matrix. For more details see the box below and http://www.jmlr.org/papers/volume11/ilin10a/ilin10a.pdf

PCA in a nutshell

Principal component analysis (PCA) is a technique to transform data sets of high dimensional vectors into lower dimensional ones. This is useful, for instance, for feature extraction and noise reduction. PCA finds a smaller-dimensional linear representation of data vectors such that the original data could be reconstructed from the compressed representation with the minimum square error.

The most popular approach to PCA is based on the eigen-decomposition of the sample covariance matrix

\begin{equation*} \mathbf{C} = \dfrac{1}{n} \, \mathbf{Y} \, \mathbf{Y}^T \end{equation*}

where $\mathbf{Y}$ is the centered (row-wise zero empirical mean) $n x d$ data matrix, where $d$ is the number of data points and $n$ the number of variables. After computing the eigenvector ($\mathbf{U}$) and eigenvalue ($\mathbf{D}$) matrix of the covariance matrix

\begin{equation*} \mathbf{C} = \mathbf{U} \, \mathbf{D} \, \mathbf{U}^T \end{equation*}

and rearranging their columns in order of decreasing eigenvalue, the principal components (PC or factors) are computed as

\begin{equation*} \mathbf{Z} = \mathbf{Y} \, \mathbf{P} \end{equation*}

where $P = \mathbf{U}_p $ represent the eigenvectors that correspond to the largest $p$ eigenvalues, i.e. to the largest amount of explained variance. The original (centered) data can then be reconstructed as

\begin{equation*} \mathbf{\hat{Y}} = \mathbf{Z} \, \mathbf{P}^T \end{equation*}

PPCA in a nutshell

PCA can also be described as the maximum likelihood solution of a probabilistic latent variable model, which is known as PPCA:

\begin{equation*} Y_{ij} = \mathbf{P}_i \, \mathbf{Z}_j + m_i + \varepsilon_{ij} \quad i = 1, .., d; \, j = 1, .., n \end{equation*}

with

\begin{align} p(\mathbf{Z}_j) \sim N(0, \mathbb{1}) \\ p(\varepsilon_{ij}) \sim N(0, \nu) \end{align}

Both the principal components $Z$ and the noise $\varepsilon$ are assumed normally distributed. The model can be identified by finding the maximum likelihood (ML) estimate for the model parameters using the Expectation-Maximization (EM) algorithm by minimizing the mean-square error of the observed part of the data. EM is a general framework for learning parameters with incomplete data which iteratively updates the expected complete data log-likelihood and the maximum likelihood estimates of the parameters. In PPCA, the data are incomplete because the principal components, $Z_i$, are not observed and are treated as latent variables. When missing data are present, in the E-step, the expectation of the complete-data log-likelihood is taken with respect to the conditional distribution of the latent variables given the observed variables. In this case, the update EM rules are the following.

  1. E-step: \begin{align} \mathbf{\Sigma}_{\mathbf{Z}_j} = \nu \left(\nu \, \mathbb{1} + \sum_i \mathbf{P}_i \mathbf{P}_i^T \right)^{-1} \\ \overline{\mathbf{Z}}_j = \dfrac{1}{\nu}\mathbf{\Sigma}_{\mathbf{Z}_j} \sum_i \mathbf{P}_i \left(Y_{ij}- m_i \right) \\ m_{i} = \dfrac{1}{n} \sum_j \left(Y_{ij} - \mathbf{P}_i^T \, \overline{\mathbf{Z}}_j \right) \\ \end{align}

  2. M-step: \begin{align} \mathbf{P}_{i} = \left( \sum_j \overline{\mathbf{Z}}_j \overline{\mathbf{Z}}_j ^T + \mathbf{\Sigma}_{\mathbf{Z}_j} \right)^{-1} \sum_j \overline{\mathbf{Z}}_j \, \left(Y_{ij}- m_{ij} \right)\\ \nu = \dfrac{1}{n} \sum_{ij} \left[ \left(Y_{ij} - \mathbf{P}_i^T \, \overline{\mathbf{Z}}_j - m_i \right)^2 + \mathbf{P}_i^T \, \overline{\mathbf{Z}}_j \mathbf{P}_i \, \right] \end{align}

where each row of $\mathbf{P}$ and $\overline{\mathbf{Z}}$ is recomputed based only on those columns of $\overline{\mathbf{Z}}$ which contribute to the reconstruction of the observed values in the corresponding row of the data matrix.

In [21]:
similarity_predictors = selected_variables
similarity = cf.analysis.similarity(
    origin_locations,
    target_polygon,
    similarity_predictors,
    z=zoom
)
Number of retained PCA components: min
13
Number of retained PCA components: max
26
Number of ensemble members
10

Rank distances in variable space: when is a "small" distance small enough? Or how to define similarity

So far we computed distances in variable space between a given WeWork location in NY and each location in the target area in LA. We can then select the target locations with the smallest distances as the best candidates where to open/relocate WeWork offices in LA. However, how do we know that a distance is small enough, i. e. how to we define similarity?

To answer to this question, we defined a similarity skill score (SS) by comparing the score for each target location to the score from the mean vector data.

\begin{equation*} {SS}_{target}(f) = 1 - \dfrac{{S}_{target}(f)}{{S}_{target}(\widehat f)} \end{equation*}

where the score just represents the distance in variable space, i.e.

\begin{equation*} {S}_{target}(f) \equiv d\left(\mathbf{Y}_{origin}(i),\mathbf{Y}_{target}(f)\right) =: d \end{equation*}

If we account for the uncertainty in the computation of the distance for each target location via the ensemble generation, the score for each target location becomes

\begin{equation*} {S}_{target}(f) = \dfrac{1}{K} \sum_k d_k - \dfrac{1}{2K (K-1)} \sum_k \sum_l \left| d_k- d_l \right| \end{equation*}

where K is the number of ensemble members.

  • the Skill Score (SS) will be positive if and only if the target location is more similar to the origin than the mean vector data
  • a target location with larger SS will be more similar to the origin under this scoring rule

We can then order the target location in decreasing order of SS and retain only the results which satisfy a threshold condition (SS = 1 meaning perfect matching or zero distance). More information on scoring rule can be found here: https://rmets.onlinelibrary.wiley.com/doi/full/10.1002/qj.2270.

Results (DO & CARTOframes)

Having computed the one-to-one distances in variable space for all WeWork locations in NY, we can then visualize the results.

In [22]:
manhattan_office = wework_nyc_locations_df.iloc[6]
print(manhattan_office['address'])

brookly_office = wework_nyc_locations_df.iloc[0]
print(brookly_office['address'])
125 West 25th Street New York NY 10001
109 S 5th Street Brooklyn NY 11249
In [23]:
origin_location = brookly_office['geometry']
print(brookly_office['address'])
cf.viz.map(target_location, cc)
109 S 5th Street Brooklyn NY 11249
Out[23]:

Let's list the most similar locations in LA

In [24]:
similar_cells_df = similarity.twin_area(origin_location)
similar_cells_df[similar_cells_df.id_type!='origin'].head()
Out[24]:
do__demographics__total_population do__demographics__median_income do__demographics__median_rent do__poi__accomodation do__poi__education do__poi__entertainment do__poi__finance do__poi__food do__poi__health do__poi__office ... distance_ens8 distance_mean_ens8 distance_ens9 distance_mean_ens9 distance_diff_ens distance_diff_mean distance similarity_rank_score_ens similarity_rank_score_mean similarity_rank_score
id
02301231112133302 553.49 17769.16 946.33 0.0 0.0 0.0 0.0 0.0 2.0 0.0 ... 5.454714 8.939779 5.477291 8.989512 16.484853 42.778843 5.68 5.584769 9.321443 0.40
02301231113203033 1283.78 12634.06 803.84 1.0 0.0 0.0 0.0 0.0 0.0 3.0 ... 5.252791 8.939779 5.553080 8.989512 65.112413 42.778843 6.27 5.903402 9.321443 0.37
02301231113200103 1010.37 23828.27 983.73 0.0 0.0 1.0 3.0 1.0 0.0 2.0 ... 5.136235 8.939779 5.274208 8.989512 96.999376 42.778843 6.57 6.027712 9.321443 0.35
02301231113001331 483.54 30410.75 1434.88 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 6.073591 8.939779 6.063791 8.989512 8.391437 42.778843 6.17 6.126325 9.321443 0.34
02301231113200101 1194.95 16002.26 954.18 0.0 1.0 0.0 4.0 2.0 0.0 0.0 ... 5.175525 8.939779 5.476078 8.989512 112.820001 42.778843 6.75 6.127751 9.321443 0.34

5 rows × 59 columns

Upload the results to CARTO and visualize them on a map with a Builder embed based on origin location

In [25]:
top_locations = similar_cells_df[similar_cells_df.id_type!='origin'].iloc[:1000]
top_locations['geometry'] = [qk2mercatorpolygon(qk) for qk in list(top_locations.index)]
cc.write(top_locations, 'wework_la_top_locations', overwrite=True)
The following columns were changed in the CARTO copy of this dataframe:
do__demographics__total_population -> do_demographics_total_population
do__demographics__median_income -> do_demographics_median_income
do__demographics__median_rent -> do_demographics_median_rent
do__poi__accomodation -> do_poi_accomodation
do__poi__education -> do_poi_education
do__poi__entertainment -> do_poi_entertainment
do__poi__finance -> do_poi_finance
do__poi__food -> do_poi_food
do__poi__health -> do_poi_health
do__poi__office -> do_poi_office
do__poi__shop -> do_poi_shop
do__poi__transport -> do_poi_transport
do__poi__employee_here -> do_poi_employee_here
do__poi__sales_volume -> do_poi_sales_volume
do__poi__employee_count -> do_poi_employee_count
do__mobility_emodo__emd_all_all -> do_mobility_emodo_emd_all_all
do__financial__acc_index_weighted_spend_amt -> do_financial_acc_index_weighted_spend_amt
do__financial__acc_index_weighted_trans_cnt -> do_financial_acc_index_weighted_trans_cnt
do__financial__acc_index_weighted_acct_cnt -> do_financial_acc_index_weighted_acct_cnt
do__financial__btn_index_weighted_spend_amt -> do_financial_btn_index_weighted_spend_amt
do__financial__btn_index_weighted_trans_cnt -> do_financial_btn_index_weighted_trans_cnt
do__financial__btn_index_weighted_acct_cnt -> do_financial_btn_index_weighted_acct_cnt
do__financial__eap_index_weighted_spend_amt -> do_financial_eap_index_weighted_spend_amt
do__financial__eap_index_weighted_trans_cnt -> do_financial_eap_index_weighted_trans_cnt
do__financial__eap_index_weighted_acct_cnt -> do_financial_eap_index_weighted_acct_cnt
do__financial__gro_index_weighted_spend_amt -> do_financial_gro_index_weighted_spend_amt
do__financial__gro_index_weighted_trans_cnt -> do_financial_gro_index_weighted_trans_cnt
do__financial__gro_index_weighted_acct_cnt -> do_financial_gro_index_weighted_acct_cnt
do__financial__ret_index_weighted_spend_amt -> do_financial_ret_index_weighted_spend_amt
do__financial__ret_index_weighted_trans_cnt -> do_financial_ret_index_weighted_trans_cnt
do__financial__ret_index_weighted_acct_cnt -> do_financial_ret_index_weighted_acct_cnt
Table successfully written to CARTO: https://do-v2-demo.carto.com/dataset/wework_la_top_locations
In [26]:
HTML('''
<iframe width="100%" height="600" frameborder="0" src="https://do-v2-demo.carto.com/builder/13c6479a-6cf3-4ab9-af7b-25e34106d1eb/embed" allowfullscreen webkitallowfullscreen mozallowfullscreen oallowfullscreen msallowfullscreen></iframe>
''')    
Out[26]: